11  Data Visualization Fundamentals

One picture is worth a thousand words - Fred R. Barnard

Visual perception offers the highest bandwidth channel, as we acquire much more information through visual perception than with all of the other channels combined, as billions of our neurons are dedicated to this task. Moreover, the processing of visual information is, at its first stages, a highly parallel process. Thus, it is generally easier for humans to comprehend information with plots, diagrams and pictures, rather than with text and numbers. This makes data visualizations a vital part of data science. Some of the key purposes of data visualization are:

  1. Data visualization is the first step towards exploratory data analysis (EDA), which reveals trends, patterns, insights, or even irregularities in data.
  2. Data visualization can help explain the workings of complex mathematical models.
  3. Data visualization are an elegant way to summarise the findings of a data analysis project.
  4. Data visualizations (especially interactive ones such as those on Tableau) may be the end-product of data analytics project, where the stakeholders make decisions based on the visualizations.

11.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Understand the fundamental principles of effective data visualization
  • Classify data types and choose appropriate visualization methods
  • Create basic plots using Pandas’ built-in plotting capabilities
  • Customize sophisticated visualizations with Matplotlib’s pyplot interface
  • Design publication-quality statistical plots using Seaborn
  • Apply best practices for visual storytelling with data
  • Compare different visualization libraries and their use cases

11.2 The Art of Visualization: Choosing the Right Plot Type

There are various types of plots available, and selecting the appropriate one is crucial for successful data visualization. The choice primarily depends on two factors:

  • The type of data you are working with, and
  • The role of visualization in your data analysis

11.2.1 Data Classification for Visualization

Data visualization is commonly used to plot data in a pandas DataFrame. The data can be classified into two categories:

  • Numeric Data: This type of data represents quantities and can take any value within a range. Common examples include age, height, temperature, etc.

  • Categorical Data: This type of data represents distinct categories or groups. It can be nominal (no inherent order, like colors or names) or ordinal (with a defined order, like ratings).

11.2.2 The Role of Visualization in Data Analysis

Data visualization is essential for effectively communicating insights derived from data analysis. By using various visualization techniques, we can uncover patterns, and understand relationships. Below, we discuss different types of data exploration and the relevant visualizations used for each.

11.2.2.1 Univariate Exploration

Purpose: Univariate exploration analyzes a single variable to understand its distribution, central tendency, and spread.

11.2.2.1.1 Common Visualizations:
  • Histograms: Display the frequency distribution of a numeric variable, helping to identify the shape of the data (e.g., normal, skewed).
  • Box Plots: Summarize key statistics of a variable, including median, quartiles, and potential outliers.
  • Bar Plots: Show the count or proportion of categorical variables, revealing the frequency of each category.
  • Line Plots: Used to display trends in numeric data over time, helping to visualize changes in a variable.
11.2.2.1.2 Insights Gained:
  • Identify outliers and anomalies.
  • Understand the range and distribution of values.
  • Determine central tendency (mean, median, mode).

11.2.2.2 Bivariate Analysis

Purpose: Bivariate analysis examines the relationship between two variables, helping to understand how changes in one variable might affect another.

11.2.2.2.1 Common Visualizations:
  • Scatter Plots: Illustrate the relationship between two numeric variables, highlighting trends and correlations.
  • Grouped Bar Plots: Compare categorical variables against a numeric variable, revealing trends across categories.
  • Heatmaps: Represent correlation coefficients between pairs of variables, allowing easy identification of strong correlations.
11.2.2.2.2 Insights Gained:
  • Assess the strength and direction of relationships (positive, negative, or no correlation).
  • Identify potential predictive relationships for further analysis.
  • Discover patterns that may indicate causal relationships.

11.2.2.3 Multivariate Analysis

Purpose: Multivariate analysis investigates more than two variables simultaneously, providing a comprehensive view of complex relationships and interactions.

11.2.2.3.1 Common Visualizations:
  • Pair Plots: Show pairwise relationships in a dataset, facilitating quick insights into correlations among multiple variables.
  • 3D Scatter Plots: Visualize the interaction between three numeric variables in a three-dimensional space.
  • Facet Grids: Display multiple plots for different subsets of data, enabling comparisons across categories.
11.2.2.3.2 Insights Gained:
  • Understand interactions and dependencies among multiple variables.
  • Identify clusters or groups within the data.
  • Enhance predictive modeling by considering multiple influences.

11.2.3 Quick Reference: Data Types and Visualization Mapping

Data Type Examples Best Visualizations
Single Numeric Age, Temperature, Price Histogram, Box Plot, Density Plot
Single Categorical Gender, Region, Product Type Bar Chart, Pie Chart, Count Plot
Time Series Stock prices, Daily sales Line Plot, Area Chart
Two Numeric Height vs Weight, Price vs Sales Scatter Plot, Regression Plot
Numeric + Categorical Sales by Region, Age by Gender Grouped Bar, Box Plot, Violin Plot
Two Categorical Gender vs Product Preference Stacked Bar, Grouped Bar, Heatmap
Three+ Variables Multiple dimensions Pair Plot, Facet Grid, 3D Scatter
Correlations Feature relationships Correlation Matrix, Heatmap

Choosing the appropriate plot depends on the data type and the specific analysis purpose. Numeric data typically requires plots that can handle continuous data (like line plots or histograms), while categorical data often benefits from comparisons (like bar plots or pie charts). Always consider what story you want to tell with your data and select your visualization method accordingly.

11.3 Visualization Tools: The Python Ecosystem

Python offers a rich ecosystem of visualization libraries, each designed for specific purposes. In this course, we’ll focus on three fundamental libraries that form the foundation of data visualization in Python.

11.3.1 The Three-Library Approach

1. Pandas - Quick Exploratory Plots - Built-in .plot() method for DataFrames - Perfect for rapid data exploration - Minimal code required

2. Matplotlib - Complete Control - Low-level plotting library - Maximum customization capability - Foundation for other libraries

3. Seaborn - Statistical Beauty - High-level interface built on Matplotlib - Specialized for statistical visualization - Beautiful defaults out of the box

11.3.2 Why These Three?

Feature Benefit
Complementary Each library excels at different tasks
Integrated They work seamlessly together
Industry Standard Most widely used in data science
Well-Documented Extensive resources and community support
Progressive Learning Start simple (Pandas) → Advanced (Matplotlib/Seaborn)

11.3.3 Library Comparison at a Glance

Pandas Plot
├─ Pros: Fastest to write, DataFrame-native
└─ Cons: Limited customization, basic styling

Matplotlib
├─ Pros: Complete control, publication-quality
└─ Cons: Verbose syntax, requires more code

Seaborn
├─ Pros: Beautiful defaults, statistical functions
└─ Cons: Less control than Matplotlib, opinionated

11.3.4 Step 1: Import Libraries

Let’s start by importing all three libraries with their standard aliases:

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# We'll also import numpy for numerical operations
import numpy as np

print("✓ Visualization libraries imported and configured successfully!")
print(f"  - Pandas version: {pd.__version__}")
print(f"  - Matplotlib version: {matplotlib.__version__}")
print(f"  - Seaborn version: {sns.__version__}")
print(f"  - NumPy version: {np.__version__}")
✓ Visualization libraries imported and configured successfully!
  - Pandas version: 2.2.2
  - Matplotlib version: 3.8.4
  - Seaborn version: 0.13.2
  - NumPy version: 1.26.4

11.3.5 Step 2: Understanding the Imports

Let’s understand what each import does:

import pandas as pd - Imports the pandas library for data manipulation - Provides the .plot() method on DataFrames - pd is the universally recognized alias

import matplotlib.pyplot as plt - Imports matplotlib’s pyplot module for plotting - pyplot provides MATLAB-like interface for creating plots - plt is the standard alias used in all documentation

import seaborn as sns - Imports seaborn for statistical visualizations - Built on top of matplotlib with enhanced styling - sns alias comes from the TV show The West Wing

import numpy as np - Not a visualization library, but essential for: - Generating sample data for demonstrations - Performing numerical calculations - Creating arrays for plotting

11.3.6 Step 3: Configure Your Environment

Now let’s set up optimal defaults for our visualization environment:

# Configure matplotlib for better display in Jupyter
%matplotlib inline

# Set default figure size for all plots (width, height in inches)
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8, 5)  # Larger default size
matplotlib.rcParams['figure.dpi'] = 100          # Resolution
matplotlib.rcParams['font.size'] = 11            # Default font size

# Configure Seaborn style
sns.set_style("whitegrid")                       # Clean grid background
sns.set_palette("deep")                          # Colorblind-friendly palette
sns.set_context("notebook")                      # Appropriate scaling

11.3.7 Step 4: Understanding the Configuration

Let’s break down what each configuration does:

11.3.7.1 Matplotlib Configuration

%matplotlib inline (Jupyter Magic Command)

  • Displays plots directly in the notebook
  • Without this, plots may not appear or open in separate windows
  • Only needed once per notebook session

figure.figsize - Default Plot Dimensions

  • Format: (width, height) in inches
  • Default is (6.4, 4.8) - often too small
  • We set (10, 6) for better visibility
  • Can be overridden for individual plots

figure.dpi - Dots Per Inch (Resolution)

  • Controls image quality and sharpness
  • Default: 72 DPI (screen display)
  • We set 100 DPI for crisper display
  • Use 300 DPI for publication-quality exports

font.size - Text Size

  • Default: 10 points
  • We set 11 for better readability
  • Affects all text: labels, titles, legends

11.3.7.2 Seaborn Configuration

set_style("whitegrid") - Visual Theme

  • Options: darkgrid, whitegrid, dark, white, ticks
  • whitegrid: Clean white background with subtle grid
  • Professional appearance suitable for presentations

set_palette("deep") - Color Scheme

  • Options: deep, muted, pastel, colorblind, etc.
  • deep: Saturated colors with good contrast
  • Automatically applied to Seaborn plots

set_context("notebook") - Scaling Preset

  • Options: paper, notebook, talk, poster
  • Controls relative size of elements
  • notebook: Optimal for Jupyter display
  • Use talk for presentations, poster for large displays

11.3.8 Step 5: Verify Your Setup

Let’s create a quick test to confirm everything is working correctly:

# Quick test plot to verify configuration
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Test 1: Pandas plot
test_data = pd.DataFrame({
    'x': range(10),
    'y': np.random.randn(10).cumsum()
})
test_data.plot(x='x', y='y', ax=axes[0], title='Pandas Plot', legend=False)
axes[0].set_ylabel('Value')

# Test 2: Matplotlib plot
axes[1].set_title('Matplotlib Plot')
axes[1].set_xlabel('X axis')
axes[1].set_ylabel('Y axis')
axes[1].grid(True, alpha=0.3)

# Test 3: Seaborn plot
test_df = pd.DataFrame({
    'category': ['A', 'B', 'C', 'D', 'E'],
    'values': np.random.randint(10, 50, 5)
})
sns.barplot(data=test_df, x='category', y='values', ax=axes[2])
axes[2].set_title('Seaborn Plot')

plt.tight_layout()
plt.suptitle('All Three Libraries Working!', fontsize=14, y=1.02)
plt.show()

print("\n✓ All libraries are working correctly!")
print("  You're ready to start creating visualizations!")


✓ All libraries are working correctly!
  You're ready to start creating visualizations!

11.3.9 Step 6: Pro Tips for Working with These Libraries

Now that everything is set up, here are some professional tips for effective visualization:

11.3.9.1 Layer Your Learning

Start simple and progressively add complexity:

  • Begin: Use Pandas .plot() for quick data exploration
  • Enhance: Add Matplotlib customization for specific needs
  • Refine: Use Seaborn for statistical visualizations
  • Combine: Mix all three in complex projects

11.3.9.2 Common Workflow Pattern

Here’s a typical pattern that combines all three libraries:

# Step 1: Create base plot with Seaborn (easy syntax + beautiful defaults)
sns.scatterplot(data=df, x='var1', y='var2', hue='category')

# Step 2: Customize with Matplotlib (fine-grained control)
plt.title('My Custom Title', fontsize=14, fontweight='bold')
plt.xlabel('Custom X Label')
plt.axhline(y=0, color='red', linestyle='--', alpha=0.5)

# Step 3: Display
plt.show()

11.3.9.3 Troubleshooting Quick Reference

Problem Solution
Plots not showing Add plt.show() or verify %matplotlib inline is set
Plots overlapping Use plt.figure() before each new plot or plt.clf() to clear
Text too small Increase font.size in rcParams or use fontsize parameter
Poor image quality Save with higher DPI: plt.savefig('plot.png', dpi=300)
Colors hard to distinguish Use colorblind-friendly palette: sns.set_palette("colorblind")
Legend outside plot area Adjust with bbox_inches='tight' when saving

11.3.9.4 Performance Tips for Large Datasets

When working with datasets >100K points:

  • Sampling: Plot a representative random subset
  • Aggregation: Group/bin data before plotting
  • Rasterization: Use rasterized=True in scatter plots
  • Alternative libraries: Consider Plotly or Bokeh for interactive visualizations

11.3.9.5 Saving High-Quality Plots

Professional way to save your visualizations:

plt.savefig('my_plot.png', 
            dpi=300,                    # Publication quality (300 DPI)
            bbox_inches='tight',        # Remove extra whitespace
            facecolor='white',          # Ensure white background
            edgecolor='none',           # No border
            transparent=False)          # Solid background

11.3.10 You’re Ready!

Your visualization environment is now fully configured and tested. In the following sections, we’ll explore each library in detail, starting with Pandas for quick exploratory analysis.

11.4 Basic Plotting with Pandas

In previous chapters, we focused on using pandas for data reading and analysis. In addition to its powerful data manipulation capabilities, pandas also provides built-in plotting tools that make it especially valuable for:

  • Rapid Exploration - Create plots with minimal code
  • Seamless Integration - No data conversion needed
  • Quick Insights - Perfect for initial data investigation
  • Iterative Analysis - Easy to modify and regenerate

The Power of .plot(): Pandas’ .plot() method is your Swiss Army knife for quick visualizations. It’s built on Matplotlib but provides a simpler, DataFrame-centric interface.

11.4.1 📂 Dataset: COVID-19 Cases

In this section, we’ll use real COVID-19 data to demonstrate practical plotting techniques. This dataset contains time series information about new cases and deaths, making it ideal for learning data visualization.

covid_df = pd.read_csv('./Datasets/covid.csv')
covid_df.head(5)
date new_cases total_cases new_deaths total_deaths new_tests total_tests cases_per_million deaths_per_million tests_per_million
0 2019-12-31 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 NaN
1 2020-01-01 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 NaN
2 2020-01-02 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 NaN
3 2020-01-03 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 NaN
4 2020-01-04 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 NaN

11.4.2 Step 1: Your First Pandas Plot

Let’s start with the simplest possible plot - visualizing a single variable.

Concept: Line Plots for Time Series Line plots are ideal for showing changes over continuous data, such as the progression of new cases over a series of dates.

The Simplest Plot: With pandas, creating a plot is as easy as calling .plot() on a Series or DataFrame:

covid_df.new_cases.plot()

Problem Identified: 🤔

While this plot shows the overall trend, it’s hard to tell when the peak occurred - the X-axis shows indices (0, 1, 2…) instead of dates!

Solution: Set the Date as Index

Since this is a time series dataset, we can use the date column as the DataFrame index. This tells pandas to use actual dates on the X-axis:

covid_df.set_index('date', inplace=True)
covid_df.new_cases.plot(rot=45);

Much Better!

Now we can clearly see that the peak occurred around March 2020. The rot=45 parameter rotates the X-axis labels 45 degrees for better readability.

Key Insight: When working with time series data, always set the date/time column as the index for automatic, proper time-axis formatting.

11.4.3 Step 2: Plotting Multiple Variables

Now that we have one variable plotted, let’s compare multiple variables on the same plot to see relationships.

Goal: Compare new cases and new deaths trends over time

How: Simply pass a list of column names to the DataFrame:

covid_df[['new_deaths', 'new_cases']].plot();

What Happened?

Pandas automatically:

  • Created two lines (one per column)
  • Assigned different colors to each line
  • Generated a legend with column names
  • Used the index (dates) for the X-axis

11.4.4 Step 3: Customizing Your Plots

By default, pandas generates line plots using the .plot() method. However, you can adjust several parameters to enhance the appearance:

Common Customization Parameters:

Parameter Purpose Example
figsize Set figure dimensions (width, height) in inches figsize=(12, 6)
linewidth or lw Control line thickness linewidth=2
marker Add data point markers marker='o'
color Set line color(s) color='red' or color=['red', 'blue']
alpha Set transparency (0=transparent, 1=opaque) alpha=0.7
rot Rotate X-axis labels rot=45
grid Show/hide gridlines grid=True
title Add plot title title='My Plot'

Let’s enhance our plot:

covid_df[['new_deaths', 'new_cases']].plot(figsize=(12, 6), linewidth=2, marker='o');

Result: The plot is now larger, has thicker lines, and shows markers at each data point - much easier to read!

11.4.5 Step 4: Different Plot Types

So far, we’ve only created line plots (the default). However, pandas supports 11 different plot types through the kind parameter:

kind Value Plot Type Best For
'line' Line plot (default) Trends over time, continuous data
'scatter' Scatter plot Relationships between two variables
'bar' Vertical bar chart Comparing categories
'barh' Horizontal bar chart Comparing categories (long labels)
'hist' Histogram Distribution of a single variable
'box' Box plot Distribution summary, outlier detection
'kde' or 'density' Kernel Density Estimate Smooth distribution visualization
'area' Area plot Cumulative trends
'pie' Pie chart Part-to-whole relationships
'hexbin' Hexagonal bin plot Dense scatter plot patterns

Let’s explore the most common plot types:

11.4.5.1 Scatter Plot: Finding Relationships

Question: Is there a correlation between new cases and new deaths?

Best Tool: Scatter plot - shows the relationship between two numerical variables

Let’s next create a scatter plot to visualize the relationship between new cases and new deaths, and explore whether there’s a correlation between them.

covid_df.plot(kind='scatter', x='new_cases', y='new_deaths', color='r', alpha=0.5);

Observation: There appears to be a positive correlation - as new cases increase, new deaths tend to increase as well. The alpha=0.5 parameter makes points semi-transparent, helping us see overlapping data points.

11.4.5.2 Histogram: Understanding Distribution

Question: What’s the typical range of daily deaths? Are there outliers?

Best Tool: Histogram - shows the frequency distribution of a single variable

covid_df.new_deaths.plot(kind='hist', color='r', alpha=0.5, bins=50);

Observation: Most days have relatively few deaths (left-skewed distribution), with occasional days having much higher counts. The bins=50 parameter creates 50 bins for finer detail.

11.4.5.3 Box Plot: Spotting Outliers

Question: What’s the median, quartiles, and outliers for new cases?

Best Tool: Box plot - summarizes distribution and highlights outliers

covid_df.new_cases.plot(kind='box');

Observation: The box shows the median (orange line), 25th-75th percentiles (box), and outliers (circles). We can see several high-outlier days with exceptionally high case counts.

11.4.6 Step 5: Choosing the Right Plot Type

Not sure which plot to use? Follow this decision guide:

📈 WANT TO SHOW TRENDS OVER TIME?
   └─> Use: kind='line' (default)
   └─> Example: Stock prices, temperature changes

🔍 WANT TO SHOW RELATIONSHIPS BETWEEN TWO VARIABLES?
   └─> Use: kind='scatter'
   └─> Example: Height vs. weight, study time vs. grades

📊 WANT TO SHOW DISTRIBUTION OF ONE VARIABLE?
   └─> Use: kind='hist' or kind='box'
   └─> Histogram: See the shape of distribution
   └─> Box plot: See median, quartiles, and outliers

📊 WANT TO COMPARE CATEGORIES?
   └─> Use: kind='bar' or kind='barh'
   └─> Example: Sales by region, counts by category

🥧 WANT TO SHOW PARTS OF A WHOLE?
   └─> Use: kind='pie'
   └─> Warning: Use sparingly - often harder to read than bar charts!

11.4.7 Additional Resources

For more plot types and detailed information, refer to the official pandas documentation:

11.4.8 Summary: Pandas Plotting Strengths & Limitations

11.4.8.1Strengths (When to Use Pandas):

  1. Speed - Create plots in 1-2 lines of code
  2. Integration - Works directly with DataFrames (no conversion)
  3. Exploration - Perfect for quick data checks
  4. Learning Curve - Easiest to learn for beginners
  5. Iteration - Rapidly test different visualizations

Best Use Cases: - Rapid data exploration during analysis - Quick sanity checks during data cleaning - Initial pattern recognition - Jupyter notebook investigations

11.4.8.2 ⚠️ Limitations (When to Move Beyond Pandas):

  1. Limited Aesthetics - Basic styling, not publication-ready
  2. Simple Layouts - Hard to create multi-panel figures
  3. Customization - Fine-grained control requires Matplotlib
  4. Statistical Plots - No built-in statistical visualizations
  5. Plot Variety - Missing advanced plot types

When to Switch Libraries:

  • → Matplotlib: Need fine-grained control, custom layouts, publication quality
  • → Seaborn: Need statistical plots, beautiful defaults, correlation matrices
  • → Plotly: Need interactive plots, dashboards, 3D visualizations

11.4.9 Practice Exercise: Apply Your Skills

Now it’s your turn! Try creating these plots with the COVID dataset:

Exercise 1: Basic Line Plot

# Plot new_cases with:
# - Figure size of (14, 6)
# - Green color
# - Title "COVID-19 New Cases Over Time"

Exercise 2: Comparison Plot

# Plot new_cases and new_deaths together with:
# - Different line styles (solid and dashed)
# - Add a grid

Exercise 3: Distribution Analysis

# Create a histogram of new_cases with:
# - 30 bins
# - Blue color with 50% transparency

Exercise 4: Relationship Analysis

# Create a scatter plot showing new_cases vs new_deaths
# Add appropriate axis labels

💡 Tip: Refer to the DataFrame.plot documentation to find the parameters you need!

11.5 Data Plotting with Matplotlib pyplot Interface

You’ve just learned to create plots with pandas’ .plot() method. But here’s an important insight: Pandas plotting is built on top of Matplotlib!

When you call df.plot(), pandas is actually calling Matplotlib functions behind the scenes. Think of it like this:

┌─────────────────────────────────────┐
│     Pandas .plot()                  │  ← High-level, easy, limited control
├─────────────────────────────────────┤
│     Matplotlib (pyplot)             │  ← Low-level, powerful, full control
└─────────────────────────────────────┘

Why Learn Matplotlib Directly?

While Pandas is great for quick plots, Matplotlib gives you:

  • Maximum Control - Customize every element of your plots
  • Complex Layouts - Create multi-panel figures and subplots
  • Publication Quality - Professional scientific visualizations
  • Foundation Knowledge - Understanding how visualization really works in Python
  • Flexibility - Work with any data structure (lists, arrays, DataFrames)

11.5.1 What is Matplotlib?

Matplotlib is:

  • A comprehensive library for creating static, animated, and interactive visualizations in Python
  • Designed to emulate MATLAB’s plotting interface (hence the name)
  • The foundation for many other visualization libraries (Pandas, Seaborn, etc.)
  • Compatible with Python scripts, IPython shells, Jupyter notebooks, and web servers
  • Mostly written in Python, with some segments in C, Objective-C, and JavaScript for platform compatibility

11.5.2 Step 1: Understanding pyplot

What is pyplot?

pyplot is a module within Matplotlib that provides a state-based interface for creating plots. It’s the most commonly used part of Matplotlib.

Key Concepts:

  • Matplotlib = The entire plotting library/package
  • pyplot = A specific module within Matplotlib that makes plotting easier
  • plt = The conventional alias used when importing pyplot

The State-Based Interface:

pyplot maintains an internal current figure and axes. When you call functions like plt.plot(), plt.xlabel(), etc., they automatically apply to the current figure. This makes it easy to build plots step by step without explicitly managing figure objects.

Import Convention:

The standard way to import pyplot is:

import matplotlib.pyplot as plt

Data Sources:

pyplot can work with various data types: - Python lists - NumPy arrays - Pandas Series and DataFrames

Note: Internally, all sequences are converted to NumPy arrays for processing.

11.5.3 Step 2: Your First pyplot Plot

Let’s start with the simplest possible plot using Python lists to illustrate basic plotting with Matplotlib pyplot:

yield_apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931]

Create the Plot:

Now let’s visualize this data with a simple line plot:

plt.plot(yield_apples);

What Just Happened?

Calling plt.plot() creates a line chart showing the trend in apple yields.

The Semicolon Trick:

You might notice the semicolon (;) at the end of the command. Without it, Matplotlib returns a text representation like [<matplotlib.lines.Line2D at 0x2194b571df0>] before displaying the plot. The semicolon suppresses this output, showing only the graph.

With semicolon (cleaner output):

plt.plot(yield_apples);

Problem: The X-axis shows list indices (0, 1, 2, 3, 4, 5) instead of meaningful values like years.

Solution: Provide both X and Y data to plt.plot().

11.5.4 Step 3: Customizing Axes

Let’s make our plot more informative by specifying custom X-axis values. We’ll use years instead of indices:

years = [2010, 2011, 2012, 2013, 2014, 2015]
yield_apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931]
plt.plot(years, yield_apples);

Much Better! Now the X-axis shows actual years, making the plot meaningful.

Syntax: plt.plot(x_values, y_values)

11.5.5 Step 4: Plotting Multiple Lines

You can create multiple lines on the same plot by calling plt.plot() multiple times. Each call adds another line to the current figure.

Use Case: Let’s compare the yields of apples vs. oranges over time.

years = range(2000, 2012)
apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931, 0.934, 0.936, 0.937, 0.9375, 0.9372, 0.939]
oranges = [0.962, 0.941, 0.930, 0.923, 0.918, 0.908, 0.907, 0.904, 0.901, 0.898, 0.9, 0.896, ]
plt.plot(years, apples)
plt.plot(years, oranges);

Result: Matplotlib automatically assigns different colors to each line and displays both on the same axes.

Default Settings:

When plt.plot() is called without formatting parameters, pyplot uses these defaults:

Property Default Value
Figure size 6.4 × 4.8 inches
Line style Solid line
Line width 1.5
First line color Blue (#1f77b4)
Subsequent colors Automatic color cycle

Customizing Global Defaults:

You can change default settings for all future plots using matplotlib.rcParams:

import matplotlib
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (7, 4)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Note: While we’re focusing on the pyplot interface in this section, Matplotlib also offers an object-oriented interface that provides even more control. We’ll explore that in the next chapter.

For more on customizing defaults, see: Customizing Matplotlib with rcParams

11.5.6 Step 5: Adding Labels, Titles, and Legends

A good plot needs clear labels so readers understand what they’re looking at. Let’s enhance our plot with descriptive text elements.

Every Complete Plot Should Have:

  • Title - What the plot shows
  • Axis Labels - What each axis represents (with units!)
  • Legend - Which line is which (when plotting multiple series)

11.5.6.1 Adding Axis Labels

Let’s start by adding labels to our axes using plt.xlabel() and plt.ylabel():

plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)');

Good! But which line is which? Let’s add a title and legend.

11.5.6.2 Adding Title and Legend

Use plt.title() for the title and plt.legend() to identify each line:

plt.plot(years, apples)
plt.plot(years, oranges)

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

Perfect! Now our plot is fully labeled and easy to understand.

11.5.7 Step 6: Styling Lines and Markers

Matplotlib offers extensive customization for how lines and markers appear. Let’s explore the options.

11.5.7.1 Adding Markers

Show data points explicitly using the marker parameter. Matplotlib provides many marker styles:

  • 'o' = circle
  • 'x' = X mark
  • 's' = square
  • '^' = triangle up
  • '*' = star
  • '+' = plus sign

See the full list of markers.

plt.plot(years, apples, marker='o')
plt.plot(years, oranges, marker='x')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

Great! The markers make it easier to see individual data points.

11.5.7.2 Complete Styling Options

The plt.plot() function supports many styling arguments:

Parameter Alias Purpose Example
color c Set line color color='red' or c='#FF0000'
linestyle ls Line style (solid, dashed, etc.) linestyle='--' or ls=':'
linewidth lw Line thickness linewidth=2 or lw=3
marker - Marker style marker='o'
markersize ms Marker size markersize=10 or ms=8
markeredgecolor mec Marker edge color mec='navy'
markeredgewidth mew Marker edge thickness mew=2
markerfacecolor mfc Marker fill color mfc='lightblue'
alpha - Transparency (0=invisible, 1=opaque) alpha=0.7

Documentation: plt.plot() reference

Example with Multiple Styling Parameters:

plt.plot(years, apples, marker='s', c='b', ls='-', lw=2, ms=8, mew=2, mec='navy')
plt.plot(years, oranges, marker='o', c='r', ls='--', lw=3, ms=10, alpha=.5)

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

Impressive! We’ve created a highly customized plot with thick lines, custom colors, different markers, and transparency.

11.5.7.3 The fmt Shorthand

For quick styling, Matplotlib provides the fmt string format as a shorthand:

Syntax: fmt = '[marker][line][color]'

Common Examples:

  • 'o-r' = circle markers, solid line, red color
  • 'x--b' = X markers, dashed line, blue color
  • '^:g' = triangle markers, dotted line, green color
  • 'sb' = square markers, no line, blue color

Color Codes:

  • 'b' = blue, 'g' = green, 'r' = red, 'c' = cyan
  • 'm' = magenta, 'y' = yellow, 'k' = black, 'w' = white

Line Styles:

  • '-' = solid, '--' = dashed, ':' = dotted, '-.' = dash-dot

Example:

plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

Much simpler code! The fmt string 's-b' means square markers, solid line, blue and 'o--r' means circle markers, dashed line, red.

11.5.7.4 Markers Without Lines

If you don’t specify a line style in fmt, only markers are drawn (no connecting lines):

plt.plot(years, apples, 'sb')
plt.plot(years, oranges, 'or')
plt.title("Yield (tons per hectare)");

Result: Only markers, no lines. Perfect for scatter-like visualizations!

11.5.8 Step 7: Controlling Figure Size

By default, Matplotlib creates figures that are 6.4 × 4.8 inches. You can change this using plt.figure(figsize=(width, height)):

plt.figure(figsize=(6, 4))

plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");

Important: Call plt.figure() BEFORE plt.plot() to set the size for that specific figure.

11.5.9 Step 8: Other Plot Types with pyplot

So far we’ve focused on line plots, but pyplot supports many other visualization types. Let’s explore them using a different dataset.

New Dataset: FIFA Player Data

Let’s load FIFA player statistics to demonstrate various plot types:

fifa = pd.read_csv('./Datasets/fifa_data.csv')
fifa.head(5)
Unnamed: 0 ID Name Age Photo Nationality Flag Overall Potential Club ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 0 158023 L. Messi 31 https://cdn.sofifa.org/players/4/19/158023.png Argentina https://cdn.sofifa.org/flags/52.png 94 94 FC Barcelona ... 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0 €226.5M
1 1 20801 Cristiano Ronaldo 33 https://cdn.sofifa.org/players/4/19/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Juventus ... 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0 €127.1M
2 2 190871 Neymar Jr 26 https://cdn.sofifa.org/players/4/19/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 93 Paris Saint-Germain ... 94.0 27.0 24.0 33.0 9.0 9.0 15.0 15.0 11.0 €228.1M
3 3 193080 De Gea 27 https://cdn.sofifa.org/players/4/19/193080.png Spain https://cdn.sofifa.org/flags/45.png 91 93 Manchester United ... 68.0 15.0 21.0 13.0 90.0 85.0 87.0 88.0 94.0 €138.6M
4 4 192985 K. De Bruyne 27 https://cdn.sofifa.org/players/4/19/192985.png Belgium https://cdn.sofifa.org/flags/7.png 91 92 Manchester City ... 88.0 68.0 58.0 51.0 15.0 13.0 5.0 10.0 13.0 €196.4M

5 rows × 89 columns

11.5.9.1 Histogram: Distribution of Values

Use Case: Understanding the distribution of player skill levels

Function: plt.hist(data, bins=number, color='color')

plt.figure(figsize=(8,5))

plt.hist(fifa.Overall, color='#abcdef')

plt.ylabel('Number of Players')
plt.xlabel('Skill Level')
plt.title('Distribution of Player Skills in FIFA 2018')
Text(0.5, 1.0, 'Distribution of Player Skills in FIFA 2018')

Observation: Player skills follow a roughly normal distribution, with most players having moderate skills (60-70 range).

11.5.9.2 Bar Chart: Comparing Categories

Use Case: Comparing counts between categories

Function: plt.bar(categories, values, color='color')

# plotting bar chart for the best players
plt.figure(figsize=(8,5))

foot_preference = fifa['Preferred Foot'].value_counts()

plt.bar(['Left', 'Right'], [foot_preference.iloc[1], foot_preference.iloc[0]], color='#abcdef')

plt.ylabel('Number of Players')
plt.title('Foot Preference of FIFA Players');

Observation: More FIFA players prefer their right foot over their left.

11.5.9.3 Pie Chart: Parts of a Whole

Use Case: Showing proportions/percentages

Function: plt.pie(values, labels=labels, autopct='format')

left = fifa.loc[fifa['Preferred Foot'] == 'Left'].count().iloc[0]
right = fifa.loc[fifa['Preferred Foot'] == 'Right'].count().iloc[0]

plt.figure(figsize=(8,5))

labels = ['Left', 'Right']
colors = ['#abcdef', '#aabbcc']

plt.pie([left, right], labels = labels, colors=colors, autopct='%.2f %%')

plt.title('Foot Preference of FIFA Players');

Tip: The autopct='%.2f %%' parameter automatically calculates and displays percentages.

11.5.9.4 Advanced Pie Chart with Explode

You can explode (pull out) specific slices to emphasize them:

plt.figure(figsize=(8,5), dpi=100)

plt.style.use('ggplot')

fifa.Weight = [int(x.strip('lbs')) if type(x)==str else x for x in fifa.Weight]

light = fifa.loc[fifa.Weight < 125].count().iloc[0]
light_medium = fifa[(fifa.Weight >= 125) & (fifa.Weight < 150)].count().iloc[0]
medium = fifa[(fifa.Weight >= 150) & (fifa.Weight < 175)].count().iloc[0]
medium_heavy = fifa[(fifa.Weight >= 175) & (fifa.Weight < 200)].count().iloc[0]
heavy = fifa[fifa.Weight >= 200].count().iloc[0]

weights = [light,light_medium, medium, medium_heavy, heavy]
label = ['under 125', '125-150', '150-175', '175-200', 'over 200']
explode = (.4,.2,0,0,.4)

plt.title('Weight of Professional Soccer Players (lbs)')

plt.pie(weights, labels=label, explode=explode, pctdistance=0.8,autopct='%.2f %%');

Advanced Features:

  • explode parameter pulls out slices (0 = normal, 0.4 = pulled out significantly)
  • plt.style.use('ggplot') applies a different visual style
  • pctdistance controls where percentages are positioned

11.5.9.5 Box Plot: Distribution Summary

Use Case: Comparing distributions across groups, identifying outliers

Function: plt.boxplot(data_list, tick_labels=labels)

plt.figure(figsize=(5,8), dpi=100)

plt.style.use('default')

barcelona = fifa.loc[fifa.Club == "FC Barcelona"]['Overall']
madrid = fifa.loc[fifa.Club == "Real Madrid"]['Overall']
revs = fifa.loc[fifa.Club == "New England Revolution"]['Overall']

# bp = plt.boxplot([barcelona, madrid, revs], labels=['a','b','c'], boxprops=dict(facecolor='red'))
bp = plt.boxplot([barcelona, madrid, revs], labels=['FC Barcelona','Real Madrid','NE Revolution'], patch_artist=True, medianprops={'linewidth': 2})

plt.title('Professional Soccer Team Comparison')
plt.ylabel('FIFA Overall Rating')

for box in bp['boxes']:
    # change outline color
    box.set(color='#4286f4', linewidth=2)
    # change fill color
    box.set(facecolor = '#e0e0e0' )
    # change hatch
    #box.set(hatch = '/')

Box Plot Elements:

  • Box = 25th to 75th percentile (middle 50% of data)
  • Orange line = Median
  • Whiskers = Extend to show data range
  • Circles = Outliers

Observation: FC Barcelona and Real Madrid have higher overall player ratings compared to New England Revolution.

11.5.9.6 Strengths of Matplotlib pyplot

  • Maximum Control - Customize every element
  • Publication Quality - Professional output suitable for papers
  • Extensive Plot Types - Line, scatter, bar, histogram, pie, box, and many more
  • Well Documented - Large community and comprehensive documentation
  • Foundation - Base for Pandas, Seaborn, and other libraries

11.5.9.7 Essential Best Practices

  1. Set Figure Size Early

    plt.figure(figsize=(width, height))  # Before plt.plot()
  2. Always Label Your Axes

    plt.xlabel('X-axis label (with units!)')
    plt.ylabel('Y-axis label (with units!)')
  3. Add Descriptive Titles

    plt.title('Clear description of what the plot shows')
  4. Include Legends for Multiple Lines

    plt.legend(['Series 1', 'Series 2'])
  5. Use Semicolons in Jupyter

    plt.plot(x, y);  # Suppresses unwanted text output
  6. Save High-Quality Figures

    plt.savefig('filename.png', dpi=300, bbox_inches='tight')

11.5.9.8 When to Use pyplot

Best For:

  • Creating custom visualizations with specific requirements
  • Building complex multi-panel layouts (covered in next chapter)
  • Fine-tuning every visual element
  • Generating publication-ready figures
  • When you need maximum control

11.5.10 What’s Next?

You’ve mastered the pyplot interface - the state-based, MATLAB-style approach to plotting.

In the next chapter, we’ll explore Matplotlib’s Object-Oriented Interface, which gives you even more control and is better for:

  • Creating complex multi-panel figures
  • Writing reusable plotting functions
  • Managing multiple figures simultaneously
  • Professional data visualization workflows

Preview of OO Interface:

# pyplot style (what we just learned)
plt.plot(x, y)
plt.xlabel('X Label')

# Object-oriented style (next chapter)
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel('X Label')

The OO interface is the professional standard for complex visualizations, but everything you learned about styling, colors, and plot types applies equally to both interfaces!

11.6 Plotting with Seaborn

Seaborn is a powerful data visualization library built on top of Matplotlib, designed to make statistical plots easier and more attractive. It provides:

  • High-level interface for drawing attractive statistical graphics
  • Beautiful default styles that are publication-ready out of the box
  • Seamless Pandas integration with direct DataFrame support
  • Built-in statistical computations (confidence intervals, regression lines, etc.)
  • Specialized plot types for statistical analysis

Think of it this way:

┌─────────────────────────────────────┐
│     Pandas .plot()                  │  ← Quickest, least control
├─────────────────────────────────────┤
│     Seaborn                         │  ← Beautiful + Statistical
├─────────────────────────────────────┤
│     Matplotlib                      │  ← Most control, most verbose
└─────────────────────────────────────┘

11.6.1 Step 1: Why Choose Seaborn Over Matplotlib?

Seaborn addresses several pain points of using Matplotlib directly:

11.6.1.1 Problem 1: Default Aesthetics

Matplotlib Challenge: - Default styles can look dated - Requires manual styling for professional appearance - Grid, colors, and fonts need explicit configuration

Seaborn Solution: - Modern, publication-ready styles out of the box - Professional appearance with zero configuration - Multiple preset themes for different contexts

11.6.1.2 Problem 2: Statistical Visualizations

Matplotlib Challenge: - Requires extensive code for statistical plots - Manual calculation of confidence intervals - Complex code for regression lines and error bars

Seaborn Solution: - Built-in statistical functions - Automatic confidence interval computation - One-line regression plots with regplot()

11.6.1.3 Problem 3: DataFrame Integration

Matplotlib Challenge: - Works primarily with arrays and lists - Requires extracting columns manually - No automatic label generation from column names

Seaborn Solution: - Native DataFrame support via data parameter - Column names automatically become labels - Natural, intuitive syntax: x='column_name'

11.6.1.4 Problem 4: Complex Multi-Panel Plots

Matplotlib Challenge: - Subplots require significant boilerplate code - Manual management of figure and axes objects - Difficult to create consistent multi-plot layouts

Seaborn Solution: - FacetGrid for automatic multi-panel layouts - PairPlot for pairwise relationship matrices - Simplified API for complex visualizations

11.6.2 Step 2: Setup and Aesthetic Customization

Before creating plots, let’s learn how to configure Seaborn’s appearance.

First, let’s import Seaborn and explore its built-in datasets:

Seaborn Datasets:

Seaborn comes with 17 built-in datasets, perfect for learning and practice. This means you can focus on visualization techniques without spending time finding and cleaning data.

import seaborn as sns
# get names of the builtin dataset
sns.get_dataset_names()
['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

Great! Now you can use any of these datasets with sns.load_dataset('dataset_name').

11.6.2.1 Customizing Plot Aesthetics

Seaborn provides powerful functions to customize the visual appearance of plots:

Style Options:

Seaborn offers five built-in styles via sns.set_style():

Style Description Best For
"whitegrid" White background with grid lines Most presentations and reports (recommended)
"darkgrid" Dark background with grid lines Data with many points or intricate details
"white" Simple white background, no grid Minimalist aesthetic, emphasis on data
"dark" Dark background without grid Emphasizing data points, reducing distractions
"ticks" White background with axis ticks Adding detail for precise reference

Let’s apply a style:

sns.set_style("whitegrid")

Perfect! Now all our Seaborn plots will have a clean, professional appearance with white background and gridlines.

11.6.3 Step 3: Distribution Plots

Distribution plots help you understand how your data is spread out. Let’s explore using the famous Iris flower dataset.

Load the Dataset:

# Load data into a Pandas dataframe
flowers_df = sns.load_dataset("iris")
flowers_df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Dataset: The Iris dataset contains measurements of 150 iris flowers from three different species. Let’s visualize the relationships between features.

11.6.4 Step 4: Relational Plots - Scatter Plots

Scatter plots show relationships between two numerical variables. Seaborn’s scatterplot() makes this easy.

11.6.4.1 Basic Scatter Plot

sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width);

Observation: The plot shows a general pattern, but we can see distinct clusters. This suggests different groups might exist in the data.

11.6.4.2 Adding a Third Dimension with Hue

The real power of Seaborn comes from easily adding additional dimensions to your plots. The hue parameter colors points based on a categorical variable.

First, let’s check what species we have:

flowers_df.species.unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)
sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width, hue=flowers_df.species, s=100);

Much more informative! Now we can clearly see:

  • Setosa flowers have smaller sepal length but larger sepal width
  • Virginica flowers show the opposite pattern (longer, narrower sepals)
  • Versicolor falls in between
  • The three species are clearly separable based on these measurements

Key Insight: The hue parameter is one of Seaborn’s most powerful features - it adds a third dimension to 2D plots with automatic color coding and legend generation.

11.6.4.3 Integrating Seaborn with Matplotlib

Since Seaborn is built on Matplotlib, you can use Matplotlib functions to customize Seaborn plots further:

plt.figure(figsize=(12, 6))
plt.title('Sepal Dimensions')

sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species,
                s=100);

Perfect combination! Seaborn creates beautiful plots quickly, and Matplotlib adds fine-tuning.

11.6.5 Step 5: Distribution Plots - Histograms

Histograms show the frequency distribution of a single variable. Let’s visualize sepal width distribution.

11.6.5.1 Basic Histogram

sns.histplot(data=flowers_df, x='sepal_width');

Observation: The distribution appears roughly bell-shaped (normal), with most values around 3.0.

11.6.5.2 Adding KDE (Kernel Density Estimate)

KDE creates a smooth curve showing the distribution’s shape:


sns.histplot(data=flowers_df, x='sepal_width', kde=True);

Enhanced! The smooth KDE curve helps visualize the overall distribution shape.

11.6.5.3 Comparing Distributions with Hue

Use hue to compare distributions across categories:

# adding hue
sns.histplot(data=flowers_df, x="sepal_width", hue="species");

Insight: Different species have different sepal width distributions:

  • Setosa tends to have wider sepals (peak around 3.4)
  • Versicolor and Virginica have narrower sepals (peaks around 2.8-3.0)

11.6.6 Step 6: Categorical Plots - Bar Plots

Bar plots show aggregate statistics (like mean) for different categories. Let’s use the Tips dataset.

Load Tips Dataset:

Dataset: Tips received by restaurant servers, including day, time, party size, and whether the customer smoked.

11.6.6.1 Basic Bar Plot

By default, barplot() shows the mean value with confidence interval error bars:

tips_df = sns.load_dataset("tips")
sns.barplot(x='day', y='total_bill', data=tips_df);

Observation: Weekend (Saturday/Sunday) bills tend to be higher on average than weekday bills.

Note: The black lines are confidence intervals showing the uncertainty in the mean estimate.

11.6.6.2 Adding Hue for Comparison

Compare tips by gender across different days:

sns.barplot(x='day', y='tip', hue='sex', data=tips_df);

Insight: Males tend to give slightly higher tips across most days.

11.6.6.3 Horizontal Bar Charts

Simply swap the x and y parameters to make bars horizontal (useful for long category names):

# make the bars horizontal simply by switching the axes
sns.barplot(x='tip', y='day', hue='sex', data=tips_df);

Horizontal bars are easier to read when you have many categories or long labels.

11.6.7 Step 7: Box Plots - Distribution Summary

Box plots provide a statistical summary of distributions and are excellent for comparing groups and identifying outliers.

11.6.7.1 Understanding Box Plots

Box plots display five key statistics that describe a distribution:

Box Plot Elements: - Box = 25th to 75th percentile (middle 50% of data - the interquartile range) - Line inside box = Median (50th percentile) - Whiskers = Extend to show the data range (typically 1.5 × IQR) - Individual points = Outliers (unusually high or low values)

11.6.7.2 Creating a Box Plot

Compare total bills across days, separated by gender:

sns.boxplot(data=tips_df, y='total_bill', x='day', hue='sex');

What can we observe from this box plot?

Key Insights from the Box Plot:

  1. Weekend vs Weekday Patterns:

    • Saturday and Sunday show higher median total bills (thicker middle line)
    • Thursday has the lowest median bills
  2. Gender Differences:

    • Males generally have higher median total bills than females across all days
    • The difference is most pronounced on weekends
  3. Variability:

    • Saturday shows the highest variability (tallest box and longest whiskers)
    • Thursday shows the most consistent bills (shortest box)
  4. Outliers:

    • Several outlier points visible on most days (individual dots above whiskers)
    • These represent unusually large bills
    • Outliers are more common on weekends
  5. Distribution Shape:

    • Most distributions are slightly right-skewed (median closer to bottom of box)
    • This suggests occasional very high bills pull the mean up

11.6.8 Step 8: Matrix Visualizations - Heatmaps

Heatmaps are excellent for visualizing 2D data matrices, especially for showing patterns in tabular data.

11.6.8.1 Understanding Heatmaps

Heatmaps represent 2-dimensional data (like a matrix or table) using colors. Darker/brighter colors indicate higher or lower values.

Use Case: Let’s visualize airline passenger data to see patterns over time.

Dataset Structure: Rows represent months, columns represent years, values show passenger counts (in thousands).

Note: The pivot() function restructures data for heatmap visualization. You’ll learn more about pivoting in later chapters.

11.6.8.2 Creating a Basic Heatmap

flights_df = sns.load_dataset("flights").pivot(index="month", columns="year", values="passengers")
flights_df
# you will learn pivot in the later chapters
year 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
month
Jan 112 115 145 171 196 204 242 284 315 340 360 417
Feb 118 126 150 180 196 188 233 277 301 318 342 391
Mar 132 141 178 193 236 235 267 317 356 362 406 419
Apr 129 135 163 181 235 227 269 313 348 348 396 461
May 121 125 172 183 229 234 270 318 355 363 420 472
Jun 135 149 178 218 243 264 315 374 422 435 472 535
Jul 148 170 199 230 264 302 364 413 465 491 548 622
Aug 148 170 199 242 272 293 347 405 467 505 559 606
Sep 136 158 184 209 237 259 312 355 404 404 463 508
Oct 119 133 162 191 211 229 274 306 347 359 407 461
Nov 104 114 146 172 180 203 237 271 305 310 362 390
Dec 118 140 166 194 201 229 278 306 336 337 405 432

flights_df is a matrix with one row for each month and one column for each year. The values show the number of passengers (in thousands) that visited the airport in a specific month of a year. We can use the sns.heatmap function to visualize the footfall at the airport.

plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df);

Reading the Heatmap:

Brighter colors indicate higher passenger counts. From this visualization we can see:

Key Patterns Observed:

  1. Seasonal Pattern (Vertical):

    • Footfall is highest in July & August (summer months) - brightest colors
    • Lowest in winter months (November-February) - darker colors
  2. Growth Trend (Horizontal):

    • Passenger numbers increase year over year
    • Each year shows generally brighter colors than the previous

11.6.8.3 Enhanced Heatmap with Annotations

Add actual numbers and customize colors for clarity:

# fmt = "d" decimal integer. output are the number in base 10
plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df, fmt="d", annot=True, cmap='Blues');

Enhancements:

  • annot=True displays actual values in each cell
  • fmt="d" formats numbers as integers (no decimals)
  • cmap='Blues' uses a blue color scheme (darker = more passengers)

Much clearer! Now we can see exact passenger counts while still benefiting from the color visualization.

11.6.9 Step 9: Correlation Matrices

One of the most powerful uses of heatmaps is visualizing correlation matrices.

11.6.9.1 What is a Correlation Matrix?

A correlation matrix is a special heatmap where:

  • Values represent correlation coefficients between pairs of variables
  • Range from -1 (perfect negative correlation) to +1 (perfect positive correlation)
  • Shows strength and direction of linear relationships

Visual Guide:

  • Dark red/positive = Variables increase together
  • Dark blue/negative = One increases as other decreases
  • Light colors near 0 = Little to no linear relationship

Important Note: In this course, correlation always refers to Pearson’s correlation coefficient, which measures linear association between variables.

Critical Caveat: Correlation does NOT imply causation! A strong correlation means variables move together, but doesn’t tell us if one causes the other.

11.6.9.2 Computing Correlations with Pandas

Pandas provides built-in functions for correlation analysis:

Pandas Correlation Functions:

  • .corr() - Computes pairwise correlation between all numeric columns of a DataFrame
  • .corrwith() - Computes correlation of DataFrame columns with another DataFrame or Series

Let’s load a survey dataset to explore correlations:

#Pairwise correlation amongst all columns
survey_data = pd.read_csv('./Datasets/survey_data_clean.csv')

survey_data.head()
Timestamp fav_alcohol parties_per_month smoke weed introvert_extrovert love_first_sight learning_style left_right_brained personality_type ... used_python_before dominant_hand childhood_in_US gender region_of_residence political_affliation cant_change_math_ability can_change_math_ability math_is_genetic much_effort_is_lack_of_talent
0 2022/09/13 1:43:34 pm GMT-5 I don't drink 1.0 No Occasionally Introvert 0 Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... INFJ ... 1 Right 1 Female Northeast Democrat 0 1 0 0
1 2022/09/13 5:28:17 pm GMT-5 Hard liquor/Mixed drink 3.0 No Occasionally Extrovert 0 Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... ESFJ ... 1 Right 1 Male West Democrat 0 1 0 0
2 2022/09/13 7:56:38 pm GMT-5 Hard liquor/Mixed drink 3.0 No Yes Introvert 0 Kinesthetic (learn best through figuring out h... Left-brained (logic, science, critical thinkin... ISTJ ... 0 Right 0 Female International No affiliation 0 1 0 0
3 2022/09/13 10:34:37 pm GMT-5 Hard liquor/Mixed drink 12.0 No No Extrovert 0 Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... ENFJ ... 0 Right 1 Female Southeast Democrat 0 1 0 0
4 2022/09/14 4:46:19 pm GMT-5 I don't drink 1.0 No No Extrovert 1 Reading/Writing (learn best through words ofte... Right-brained (creative, art, imaginative, int... ENTJ ... 0 Right 1 Female Northeast Democrat 1 0 0 0

5 rows × 51 columns

Compute the pairwise correlation matrix:

#Pairwise correlation amongst all columns
survey_data.select_dtypes(include='number').corr()
parties_per_month love_first_sight num_insta_followers expected_marriage_age expected_starting_salary minutes_ex_per_week sleep_hours_per_day farthest_distance_travelled fav_number internet_hours_per_day ... procrastinator num_clubs student_athlete AP_stats used_python_before childhood_in_US cant_change_math_ability can_change_math_ability math_is_genetic much_effort_is_lack_of_talent
parties_per_month 1.000000 0.096129 0.239705 -0.064079 0.114881 0.195561 -0.052542 -0.017081 -0.050139 0.087390 ... -0.056871 -0.010514 0.290830 -0.013222 -0.040033 0.081905 -0.052912 0.055575 -0.013374 -0.029838
love_first_sight 0.096129 1.000000 -0.024010 -0.084406 0.080138 0.099244 -0.025378 -0.075539 0.105095 -0.007652 ... 0.033951 0.083342 0.014595 -0.062992 -0.034692 -0.118260 0.005254 0.020758 -0.003710 0.013376
num_insta_followers 0.239705 -0.024010 1.000000 -0.130157 0.127226 0.099341 -0.042421 0.011308 -0.124763 -0.028427 ... -0.089871 0.265958 0.044807 0.005947 -0.016201 0.072622 -0.150658 0.130774 -0.018411 -0.165899
expected_marriage_age -0.064079 -0.084406 -0.130157 1.000000 -0.014881 -0.088073 0.182009 -0.024038 -0.008924 -0.029772 ... -0.020012 -0.137069 -0.036122 0.010447 0.052727 0.053759 -0.072163 0.087633 -0.086898 0.052813
expected_starting_salary 0.114881 0.080138 0.127226 -0.014881 1.000000 0.134065 -0.005078 -0.028329 -0.028125 0.017479 ... 0.054273 -0.100922 -0.026219 -0.084894 -0.094541 0.081142 -0.011609 0.019171 0.078694 0.097265
minutes_ex_per_week 0.195561 0.099244 0.099341 -0.088073 0.134065 1.000000 0.049593 -0.153188 0.038758 -0.028457 ... -0.045149 -0.024572 0.576301 -0.062544 0.057760 0.235492 -0.101282 0.134430 -0.047772 -0.045141
sleep_hours_per_day -0.052542 -0.025378 -0.042421 0.182009 -0.005078 0.049593 1.000000 0.104175 -0.021909 0.017435 ... -0.176579 -0.163860 0.058361 -0.013909 0.096528 -0.059468 -0.058086 0.012174 0.027052 -0.022025
farthest_distance_travelled -0.017081 -0.075539 0.011308 -0.024038 -0.028329 -0.153188 0.104175 1.000000 -0.108661 0.049450 ... 0.032492 -0.045214 -0.158027 0.010580 0.012353 -0.282821 -0.046074 0.017935 0.110037 0.046895
fav_number -0.050139 0.105095 -0.124763 -0.008924 -0.028125 0.038758 -0.021909 -0.108661 1.000000 -0.013070 ... 0.085508 -0.013696 -0.014435 0.091011 0.030736 0.072894 -0.032534 0.034319 -0.063692 -0.073777
internet_hours_per_day 0.087390 -0.007652 -0.028427 -0.029772 0.017479 -0.028457 0.017435 0.049450 -0.013070 1.000000 ... 0.048239 0.064527 -0.017944 0.001818 0.051970 0.033120 -0.033902 0.050258 0.190205 -0.053708
only_child -0.142519 0.124345 -0.152184 -0.043141 -0.088648 -0.123371 0.038126 0.214377 -0.024419 -0.035022 ... 0.073415 -0.065484 0.064136 0.048031 -0.139898 -0.387711 0.023089 -0.019982 0.058226 0.092372
num_majors_minors -0.073127 0.108730 0.050431 -0.055280 0.021278 0.044450 -0.024339 -0.012779 0.023903 -0.073775 ... -0.073806 0.311266 -0.035500 -0.068640 -0.073388 -0.153529 -0.077501 0.024734 -0.125809 -0.064939
high_school_GPA 0.295646 0.069288 0.147402 0.017052 0.053354 -0.076471 -0.036904 -0.064116 -0.023081 -0.034485 ... 0.031561 -0.020854 0.006332 0.066837 0.072777 0.005606 -0.095025 0.093416 -0.082620 0.001373
NU_GPA -0.080548 -0.114041 0.004702 0.011925 -0.048069 -0.108177 0.143997 0.038238 -0.307656 -0.014531 ... -0.269552 0.016724 -0.027378 -0.026544 -0.008536 -0.028968 0.002094 -0.137330 0.036731 0.047840
age -0.032771 0.142384 -0.230698 0.060416 -0.102632 -0.040906 -0.035890 0.018811 0.096818 0.017515 ... -0.005892 -0.127760 -0.038315 -0.026959 0.009924 -0.152784 -0.005954 0.014759 -0.009315 -0.126370
height -0.005405 0.216072 0.009318 0.044577 0.151517 0.182090 -0.010650 -0.235067 0.041298 -0.023174 ... 0.063263 0.212038 0.080953 0.022484 0.016110 0.160309 -0.055641 0.101811 -0.064383 0.028509
height_father 0.126741 0.029419 0.179684 0.026949 0.011450 0.156227 0.097593 -0.118669 -0.032717 -0.047314 ... -0.111183 0.022701 0.155003 -0.010982 -0.006480 0.137934 -0.019593 0.008157 0.010222 0.060439
height_mother 0.079121 0.082684 0.129716 0.075316 0.033947 0.114181 -0.044089 -0.134582 -0.029568 -0.091417 ... -0.078265 0.091390 0.053258 -0.100647 -0.021396 0.119292 0.027120 0.034961 -0.035449 0.074492
procrastinator -0.056871 0.033951 -0.089871 -0.020012 0.054273 -0.045149 -0.176579 0.032492 0.085508 0.048239 ... 1.000000 0.078341 0.094363 0.003053 -0.016254 -0.090868 0.002462 0.084419 -0.001738 0.081471
num_clubs -0.010514 0.083342 0.265958 -0.137069 -0.100922 -0.024572 -0.163860 -0.045214 -0.013696 0.064527 ... 0.078341 1.000000 -0.084562 0.087438 0.115062 -0.021044 -0.136249 0.070002 -0.090570 -0.108851
student_athlete 0.290830 0.014595 0.044807 -0.036122 -0.026219 0.576301 0.058361 -0.158027 -0.014435 -0.017944 ... 0.094363 -0.084562 1.000000 -0.040686 -0.049288 0.082888 -0.066667 -0.022576 -0.060523 0.121232
AP_stats -0.013222 -0.062992 0.005947 0.010447 -0.084894 -0.062544 -0.013909 0.010580 0.091011 0.001818 ... 0.003053 0.087438 -0.040686 1.000000 0.089517 0.106584 0.081109 0.029743 -0.048375 -0.018043
used_python_before -0.040033 -0.034692 -0.016201 0.052727 -0.094541 0.057760 0.096528 0.012353 0.030736 0.051970 ... -0.016254 0.115062 -0.049288 0.089517 1.000000 0.041928 -0.011217 0.156806 0.088566 0.023366
childhood_in_US 0.081905 -0.118260 0.072622 0.053759 0.081142 0.235492 -0.059468 -0.282821 0.072894 0.033120 ... -0.090868 -0.021044 0.082888 0.106584 0.041928 1.000000 -0.008575 0.057185 -0.178003 -0.013098
cant_change_math_ability -0.052912 0.005254 -0.150658 -0.072163 -0.011609 -0.101282 -0.058086 -0.046074 -0.032534 -0.033902 ... 0.002462 -0.136249 -0.066667 0.081109 -0.011217 -0.008575 1.000000 -0.672777 0.294544 0.101835
can_change_math_ability 0.055575 0.020758 0.130774 0.087633 0.019171 0.134430 0.012174 0.017935 0.034319 0.050258 ... 0.084419 0.070002 -0.022576 0.029743 0.156806 0.057185 -0.672777 1.000000 -0.361546 -0.131047
math_is_genetic -0.013374 -0.003710 -0.018411 -0.086898 0.078694 -0.047772 0.027052 0.110037 -0.063692 0.190205 ... -0.001738 -0.090570 -0.060523 -0.048375 0.088566 -0.178003 0.294544 -0.361546 1.000000 0.154083
much_effort_is_lack_of_talent -0.029838 0.013376 -0.165899 0.052813 0.097265 -0.045141 -0.022025 0.046895 -0.073777 -0.053708 ... 0.081471 -0.108851 0.121232 -0.018043 0.023366 -0.013098 0.101835 -0.131047 0.154083 1.000000

28 rows × 28 columns

The matrix is hard to read! Let’s find which features correlate most with NU_GPA:

survey_data.select_dtypes(include='number').corrwith(survey_data.NU_GPA).sort_values(ascending = False)
NU_GPA                           1.000000
sleep_hours_per_day              0.143997
num_majors_minors                0.141988
only_child                       0.106440
much_effort_is_lack_of_talent    0.047840
farthest_distance_travelled      0.038238
math_is_genetic                  0.036731
num_clubs                        0.016724
expected_marriage_age            0.011925
num_insta_followers              0.004702
cant_change_math_ability         0.002094
used_python_before              -0.008536
internet_hours_per_day          -0.014531
AP_stats                        -0.026544
student_athlete                 -0.027378
childhood_in_US                 -0.028968
high_school_GPA                 -0.030883
height_father                   -0.040120
expected_starting_salary        -0.048069
age                             -0.052039
height_mother                   -0.079276
parties_per_month               -0.080548
height                          -0.099082
minutes_ex_per_week             -0.108177
love_first_sight                -0.114041
can_change_math_ability         -0.137330
procrastinator                  -0.269552
fav_number                      -0.307656
dtype: float64

Better! Now we can see correlations with NU_GPA sorted from strongest to weakest.

11.6.9.3 Visualizing the Correlation Matrix

Now let’s create a heatmap to see all correlations at once:

sns.set(rc={'figure.figsize':(12,10)})
sns.heatmap(survey_data.select_dtypes(include='number').corr());

Key Findings from the Correlation Heatmap:

  1. Strong Positive Correlation:

    • student_athlete is strongly positively correlated with minutes_ex_per_week
    • This makes sense: student athletes exercise more
  2. Strong Negative Correlation:

    • procrastinator is strongly negatively correlated with NU_GPA
    • Students who procrastinate tend to have lower GPAs
  3. Diagonal:

    • All 1.0 values (variables perfectly correlate with themselves)
  4. Symmetry:

    • Matrix is symmetric: correlation(A, B) = correlation(B, A)

11.6.10 Interpreting Correlation Coefficients

Understanding what correlation values mean:

Coefficient Range Interpretation Relationship Strength
0.9 to 1.0 Very strong positive Nearly perfect linear relationship
0.7 to 0.9 Strong positive Clear linear pattern
0.5 to 0.7 Moderate positive Noticeable but imperfect pattern
0.3 to 0.5 Weak positive Slight tendency
-0.3 to 0.3 Little to none No linear relationship
-0.5 to -0.3 Weak negative Slight inverse tendency
-0.7 to -0.5 Moderate negative Noticeable inverse pattern
-0.9 to -0.7 Strong negative Clear inverse linear pattern
-1.0 to -0.9 Very strong negative Nearly perfect inverse relationship

Critical Caveats:

  1. Correlation ≠ Causation

    • Strong correlation doesn’t mean one variable causes changes in the other
    • Example: Ice cream sales and drowning deaths correlate (both happen in summer), but ice cream doesn’t cause drowning!
  2. Linear Only

    • Pearson correlation only captures linear relationships
    • Variables might have strong non-linear relationships with correlation near 0
  3. Outliers Matter

    • A few extreme values can heavily influence correlation coefficients
    • Always visualize your data, don’t just look at numbers
  4. Hidden Variables

    • Third variables (confounders) might explain apparent correlations
    • Consider lurking variables before drawing conclusions

11.6.11 Summary: Seaborn Best Practices

Now that you’ve learned Seaborn, here are essential best practices:

11.6.11.1 When to Use Seaborn

Best For:

  • Statistical visualizations (distributions, relationships, comparisons)
  • Quick, beautiful plots with minimal code
  • Exploring relationships in DataFrames
  • Creating publication-ready figures with default settings
  • Plots with categorical and numerical data together

Not Ideal For:

  • Highly customized, non-standard plot types (use Matplotlib)
  • Interactive visualizations (use Plotly or Bokeh)
  • 3D plots (use Matplotlib’s mplot3d or Plotly)
  • Very simple exploratory plots (Pandas might be faster)

11.6.11.2 Essential Seaborn Workflow

1. Set Aesthetics First:

sns.set_style("whitegrid")      # Professional appearance
sns.set_context("notebook")     # Appropriate sizing
sns.set_palette("colorblind")   # Accessible colors

2. Use the data Parameter:

# Recommended: Clear and readable
sns.scatterplot(data=df, x='col1', y='col2', hue='col3')

# Avoid: Harder to read and modify
sns.scatterplot(x=df['col1'], y=df['col2'], hue=df['col3'])

3. Leverage hue for Multi-Dimensional Plots:

  • Adds a third dimension via color
  • Works across most plot types
  • Automatically generates legends

4. Combine with Matplotlib for Fine-Tuning:

sns.boxplot(data=df, x='category', y='value')
plt.title('Custom Title')           # Matplotlib customization
plt.ylabel('Custom Y Label')
plt.xticks(rotation=45)

5. Choose the Right Plot Type:

Goal Plot Type Seaborn Function
Single variable distribution Histogram histplot()
Smooth distribution kdeplot()
Compare categories Bar chart barplot()
Box plot boxplot()
Two numeric variables Scatter scatterplot()
Correlations Heatmap heatmap()
All pairwise relationships Pair plot pairplot()

11.6.11.3 Color Palette Guide

For Categorical Data (no order):

  • "deep" - default, good contrast
  • "colorblind" - accessible (highly recommended!)
  • "Set2", "Set3" - soft, professional

For Sequential Data (low to high):

  • "Blues", "Greens", "Reds" - single hue
  • "viridis", "plasma", "cividis" - perceptually uniform

For Diverging Data (meaningful midpoint):

  • "coolwarm" - blue to red through white
  • "RdBu" - red to blue
  • "vlag" - light to dark to light

11.6.12 Practice Exercises

Now apply what you’ve learned!

Exercise 1: Distribution Analysis

# Using the iris dataset:
# 1. Create a histogram of petal_length with KDE
# 2. Add hue by species
# 3. Add a title and customize figure size

Exercise 2: Categorical Comparison

# Using the tips dataset:
# 1. Create a box plot comparing total_bill across different times (Lunch vs Dinner)
# 2. Add hue by sex
# 3. Make it horizontal

Exercise 3: Correlation Analysis

# Using the iris dataset:
# 1. Compute the correlation matrix for all numeric columns
# 2. Create a heatmap with annotations
# 3. Use the 'coolwarm' color palette
# 4. What are the two most correlated features?

Exercise 4: Multi-Dimensional Scatter

# Using the tips dataset:
# 1. Create a scatter plot of total_bill vs tip
# 2. Use hue for time (Lunch/Dinner)
# 3. Use size for the party size
# 4. Add appropriate labels and title

Hint: Refer back to the examples in this section and the Seaborn documentation for parameter details!

11.7 Chapter Summary

Congratulations! You’ve completed a comprehensive journey through data visualization in Python. Let’s recap what you’ve accomplished.

11.7.1 Visualization Fundamentals

Data Understanding:

  • Distinguishing numeric vs. categorical data
  • Univariate, bivariate, and multivariate analysis
  • Matching data types to visualization types

11.7.2 Complete Plot Type Reference

Here’s a master reference of all plot types you’ve learned:

Plot Type Purpose Pandas Matplotlib Seaborn
Line Plot Trends over time/continuous data df.plot() plt.plot() sns.lineplot()
Scatter Plot Relationship between two variables df.plot(kind='scatter') plt.scatter() sns.scatterplot()
Histogram Distribution of single variable df.plot(kind='hist') plt.hist() sns.histplot()
Bar Chart Compare categories df.plot(kind='bar') plt.bar() sns.barplot()
Box Plot Distribution summary + outliers df.plot(kind='box') plt.boxplot() sns.boxplot()
Pie Chart Part-to-whole relationships df.plot(kind='pie') plt.pie() N/A
Heatmap 2D data matrix/correlations N/A plt.imshow() sns.heatmap()
KDE Plot Smooth distribution df.plot(kind='kde') N/A sns.kdeplot()

11.7.2.1 The Three-Library Ecosystem

You’ve mastered Python’s three-tier approach to visualization:

┌─────────────────────────────────────┐
│  Pandas .plot()                     │  ← Quick & Easy
├─────────────────────────────────────┤
│  Seaborn                            │  ← Beautiful & Statistical  
├─────────────────────────────────────┤
│  Matplotlib pyplot                  │  ← Complete Control
└─────────────────────────────────────┘

11.7.3 When to Use Each Library

Is this exploratory analysis?
├─ Yes → Use Pandas .plot() for speed
└─ No
   ├─ Need statistical visualization?
   │  ├─ Yes → Use Seaborn
   │  └─ No → Continue below
   └─ Need custom/complex plot?
      ├─ Yes → Use Matplotlib
      └─ No → Use Seaborn (easier syntax)

11.7.4 Essential Best Practices

11.7.4.1 Universal Guidelines

  1. Always Label Your Axes

    • Include units: Temperature (°C) not just Temperature
    • Make labels descriptive and self-explanatory
  2. Add Informative Titles

    • Describe what the plot shows
    • Answer: What am I looking at?
  3. Choose Appropriate Colors

    • Use colorblind-friendly palettes ("colorblind" in Seaborn)
    • Avoid red-green combinations
    • Ensure sufficient contrast
  4. Control Figure Size

    # Pandas
    df.plot(figsize=(12, 6))
    
    # Matplotlib
    plt.figure(figsize=(12, 6))
    
    # Seaborn (use Matplotlib)
    plt.figure(figsize=(12, 6))
    sns.scatterplot(data=df, x='x', y='y')
  5. Save High-Quality Figures

    plt.savefig('figure.png', dpi=300, bbox_inches='tight')

11.7.4.2 Library-Specific Tips

Pandas:

  • Set time column as index for time series plots
  • Use rot=45 to rotate axis labels
  • Chain .plot() with other DataFrame operations

Matplotlib:

  • Use semicolons (;) in Jupyter to suppress unwanted output
  • Call plt.figure() before creating a new plot
  • Use fmt shorthand for quick styling: 'o-r' = circles, solid line, red

Seaborn:

  • Always set style first: sns.set_style("whitegrid")
  • Use data parameter with column names (not Series)
  • Leverage hue for multi-dimensional visualization
  • Combine with Matplotlib for fine-tuning

11.7.5 Pro Tip: Combine Libraries!

The most powerful approach is mixing libraries in a single plot:

# Create beautiful plot with Seaborn
sns.scatterplot(data=df, x='x', y='y', hue='category')

# Fine-tune with Matplotlib
plt.title('Custom Title', fontsize=16, fontweight='bold')
plt.axhline(y=0, color='red', linestyle='--', alpha=0.5)
plt.tight_layout()

# Works seamlessly!

This gives you:

  • Seaborn’s easy syntax and beautiful defaults
  • Matplotlib’s precise customization
  • Best of both worlds!

11.8 Independent Study

11.8.1 Practice exercise 1

Read survey_data_clean.csv

Timestamp fav_alcohol parties_per_month smoke weed introvert_extrovert love_first_sight learning_style left_right_brained personality_type ... used_python_before dominant_hand childhood_in_US gender region_of_residence political_affliation cant_change_math_ability can_change_math_ability math_is_genetic much_effort_is_lack_of_talent
0 2022/09/13 1:43:34 pm GMT-5 I don't drink 1.0 No Occasionally Introvert 0 Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... INFJ ... 1 Right 1 Female Northeast Democrat 0 1 0 0
1 2022/09/13 5:28:17 pm GMT-5 Hard liquor/Mixed drink 3.0 No Occasionally Extrovert 0 Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... ESFJ ... 1 Right 1 Male West Democrat 0 1 0 0
2 2022/09/13 7:56:38 pm GMT-5 Hard liquor/Mixed drink 3.0 No Yes Introvert 0 Kinesthetic (learn best through figuring out h... Left-brained (logic, science, critical thinkin... ISTJ ... 0 Right 0 Female International No affiliation 0 1 0 0
3 2022/09/13 10:34:37 pm GMT-5 Hard liquor/Mixed drink 12.0 No No Extrovert 0 Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... ENFJ ... 0 Right 1 Female Southeast Democrat 0 1 0 0
4 2022/09/14 4:46:19 pm GMT-5 I don't drink 1.0 No No Extrovert 1 Reading/Writing (learn best through words ofte... Right-brained (creative, art, imaginative, int... ENTJ ... 0 Right 1 Female Northeast Democrat 1 0 0 0

5 rows × 51 columns

How does the expected marriage age of the people of STAT303-1 depend on their characteristics? We’ll use visualizations to answer this question.

11.8.1.1

Make a visualization that compares the mean expected_marriage_age of introverts and extroverts (use the variable introvert_extrovert). What insights do you obtain?

11.8.1.2

Does the mean expected_marriage_age of introverts and extroverts depend on whether they believe in love in first sight (variable name: love_first_sight)? Update the previous visualization to answer the question.

11.8.1.3

In addition to love_first_sight, does the mean expected_marriage_age of introverts and extroverts depend on whether they are a procrastinator (variable name: procrastinator)? Update the previous visualization to answer the question.

11.8.1.4

Is there any critical information missing in the above visualizations that, if revealed, may cast doubts on the patterns observed in them?

11.8.2 Practice exercise 2

Read Australia_weather.csv,

Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
0 10/20/2010 Sydney 12.9 20.3 0.2 3.0 10.9 ENE 37 W ... 57 1028.8 1025.6 3 1 16.9 19.8 No 0.0 No
1 10/21/2010 Sydney 13.3 21.5 0.0 6.6 11.0 ENE 41 W ... 58 1025.9 1022.4 2 5 17.6 21.3 No 0.0 No
2 10/22/2010 Sydney 15.3 23.0 0.0 5.6 11.0 NNE 41 W ... 63 1021.4 1017.8 1 4 19.0 22.2 No 0.0 No
3 10/26/2010 Sydney 12.9 26.7 0.2 3.8 12.1 NE 33 W ... 56 1018.0 1015.0 1 5 17.8 22.5 No 0.0 No
4 10/27/2010 Sydney 14.8 23.8 0.0 6.8 9.6 SSE 54 SSE ... 69 1016.0 1014.7 2 7 20.2 20.6 No 1.8 Yes

5 rows × 24 columns

11.8.2.1

Create a histogram showing the distributions of maximum temperature in Sydney, Canberra and Melbourne.

11.8.2.2

Make a density plot showing the distributions of maximum temperature in Sydney, Canberra and Melbourne.

11.8.2.3

Show the distributions of the maximum and minimum temperatures across all locations in a single plot.

11.8.2.4

Create a scatter plot with a trendline for MinTemp and MaxTemp, including a confidence interval.

Hint: Using Seaborn, the regplot() function enables us to overlay a trendline on the scatter plot, complete with a 95% confidence interval for the trendline